Star Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. Star Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

Observations

  1. type_of_meal_plan, room_type_reserved, market_segment_type, booking_status are object variables and the rest are numberical variable.
  2. avg_price_per_room is a float variable.
  3. type_of_meal_plan, room_type_reserved, market_segment_type, booking_status are object variables need to be converted to Categorical variables.

Observations

  1. There are 56926 rows and 18 columns.

To check if there are any duplicates rows.

Observations

  1. There are 14350 duplicate rows. About 25% of the data contains duplicates. Hence, we can delete the duplicates.

Observation

  1. Now the new dataframe contains 42576 rows and 18 columns.

Verifying for Duplicate Rows.

Observations

  1. There are no duplicate rows.

Observations

  1. All columns have 42576 observations indicating no missing values.
  2. type_of_meal_plan, room_type_reserved, market_segment_type, booking_status are object variables need to be converted to Categorical variables.

Observation

  1. lead_time widely ranges from 0 to 521 days.
  2. avg_price_per_room ranges from 0 to 540 euros indicating somerooms available at free of cost.

By default the describe() function shows only the summary of numeric variables only. Let's check the summary of non-numeric variables.

Observations

  1. The variable type_of_mean_plan has 4 unique variations
  2. The variable room_type_reserved has 7 unique variations.
  3. The variable marget_segment_type has 5 unique variations.
  4. booking_status has 2 variations.

Observations

  1. 33% of overall bookings are cancelled.
  2. Most of the marget segment designations are online.
  3. More than 60% of them preferred Room_type 1. The least preferred room is Room_type 3.
  4. About ~70% of them preferred Meal Plan 1. There are some with no preferrence.

Check for unique values in the column

Observation

  1. The mean of avg_price_per_room is greater than 50% value, it is positive skewed.
  2. lead_time widely ranges from 0 to 521 days.
  3. avg_price_per_room ranges from 0 to 540 euros indicating somerooms available at free of cost
  4. maximum no_of_week_nights is 17. maximum no_of_weekend_nights is 8.

Observation

  1. There are no missing values in the dataframe.

Convert the object variables to Categorical Variables.

Observations.

The object variables are converted to categorical variables.

Exploratory Data Analysis (EDA)

Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Let us explore numerical variables first

Observations on no_of_adults

Observations

  1. no_of_adults = 2 in most of the cases.

Observations on children

Observation

  1. In most of the cases no_of_children = 0.
  2. The distribution is positively skewed.
  3. There are 5 outliers

Observations on no_of_weekend_nights

Observation

  1. There are 3 outliers.

Observations on required_car_parking_space

Observations on lead_time

Observations

  1. The distribution is positively skewed. 2.There are many outliers.

Observations on arrival_month

Observations

  1. Most of the booking are in August month.
  2. The least number of bookings are in January and November

Observations on repeated_guest

Observation

  1. Most of the customers are repeated guest.

Observations on no_of_previous_cancellations

Observation

Most of the customers didnot cancel the booking.

Observations on no_of_previous_bookings_not_cancelled.

Observation

no_of_previous_bookings_not_calceled = 0 in most of the cases

Observations

  1. Th enumber of special guest range between 0 and 3

Now Lets Explore Categorical Variables.

Observation on type_of_meal_plan

Observation

  1. Maximum number of the customers prefer Meal Plan 1
  2. There are some customers who dont have any preference.

Observation on room_type_reserved

Observation

  1. Most of the customers prefer Room Type 1
  2. Least preferred room is Room Type 3 followed by Room Type 7.

Observation on market_segment_type

Observation

  1. Most of the customers prefer 'online' market_Segmentation_type.
  2. Least preferred is Aviation

Observation on Booking_status

Observation

  1. Most of the customers have not canceled their booking.

Bivariate Analysis

Plotting Bivariate Analysis to understand the interaction with each other

Replace booking_status by numerical values

Observation

  1. no_of_previous_cancellations show high correlation with no_of_previous_bookings_not_canceled(0.58)
  2. repeated_guest show high correlation with no_of_previous_bookings_not_canceled(0.56)
  3. It is important to note that correlation does not imply causation.
  4. There are some negatively correlated.

Bivariate Scatter Plots

Observations

  1. no_of_previous_cancellations show high correlation with no_of_previous_bookings_not_canceled(0.58)
  2. repeated_guest show high correlation with no_of_previous_bookings_not_canceled(0.56)
  3. It is important to note that correlation does not imply causation.
  4. There are some negatively correlated.

Relationship between type_of_meal_plan and room_type_reserved

Observation

  1. Room_type 1 is mostly preferred followed by Room_type 4
  2. Customers who prefer Meal Plan 1 mostly use Room_type 1

Relationship betweeen type_of_meal_plan and market_segment_type

Observation

  1. Most of the customers prefer Meal Plan 1
  2. Mean plan 1 customers prefer online market_segment_type.
  3. Customers who have not selected any meal plan prefer online market_segment_type.

Relationship between type_of_meal_plan and booking_status

Observation

  1. Most of the bookings are not_canceled.
  2. Customers who prefer Meal Plan 1 have more cancellations than other meal plan groups.

Relationship between room_type_reserved and market_segment_type

Observations

  1. Customers prefer Online market segment type.
  2. Most of the customer choose Room Type 1 and online market segment type.

Relationship between market_segment_type and booking_status

Observation

  1. It is observed that customers who prefer Online market segmentation had more cancellations than others.
  2. There are more customers with booking_status = not_canceled.

Relationship between room_type_reserved and booking_status

Observation

  1. Customers who prefer Room_type 1 had more cancellations than other.
  2. There are more customers with booking_status = not_canceled.

Relationship between no_of_adults and avg_price_per_room

Observation

  1. The more the number of adults, higher the average price of the room.
  2. The price is highest when the no_of_adults = 2.

Relationship between no_of_children and booking_status

Observation

  1. The chances of booking getting canceled is lesser when there are more children

Relationship between no_of_weekend_nights and booking_status

Observation

  1. The chances of cancelations increases with the no_of_weekend_nights

Relationship between no_of_week_nights and booking_status

Observation

  1. The cancellations is more when the no_of_week_nights choosen are more.

Relationship between required_car_parking_space and booking_status

Observation

  1. Chances of room getting canceled is more if the required_car_parking_space = 0

Relationship between lead_time and booking_status

Observation

  1. The more the lead_time, the chances of cancelation increases.

Relationship between arrival_year and booking_status

Observation

  1. The number of cancelations are more in 2019

Relationship between arrival_month and booking_status

Observation

  1. .The cancellations are lesser in January and December.

Relationship between no_of_previous_cancellations and booking_Status

Observation

  1. There seems to be more cancelations as the no_of_previous_cancellations decreases.

Relationship between no_of_previous_bookings_not_canceled and booking_status

Observation

  1. Most of the bookings are not canceled.

Relationship between no_of_special_requests and booking_status

Observation

  1. The booking_status = canceled are more when the no_of_special_requests are less

Relationship between booking_status and avg_price_per_room

Observation

  1. The avg_price_per_room is high if the customer chooses booking_status = canceled..

Relationship between type_of_meal_plan and booking_status

Observation

  1. The number of cancelations are more for the bookings where Meal Plan 2 is choosen and less for Mean Plan 3

Relationship between market_Segment_type and booking_status

Observation

  1. Max number of cancelations are seen form market_segment_type = Online and least for market_segment_type = Complementary

Relationship between room_type_reserved and booking_status

Observation

  1. Max number of cancelations are seen when Room Type 6 is choosen.

Correlation between type_of_meal_plan, avg_price_per_room and room_type_reserved

Observation

  1. avg_price_per_room of Room Type 2 is lesser for all the meal plans.
  2. Meal plan 2 is expensive compared to other meal plans.
  3. Room type 6 along with Meal plan 2 is most expensive.

Correlation between type_of_meal_plan, avg_price_per_room and market_segment_type

Observation

  1. The average price of the room is less if a customer chooses Market_segment_type = Complementary for all the meal plans.
  2. The average price of the room is less if a customer chooses Market_segment_type = Offline and meal plan 3 is most expensive.
  3. room_type = Meal plan 1 and room_type = Not Selected for all market_segment_type is economical

Correlation between type_of_meal_plan, avg_price_per_room and booking_status

Observation

  1. Average price of the room if the customer chooses Meal Plan 3 and booking_status = cancelled is most expensive.
  2. The customer whose booking_status is not_cancelled and if he chooses Meal Plan 1, the average price of the room will be lesser.

Correlation between type_of_meal_plan, avg_price_per_room and market_segment_type

Observation

  1. Room Type 7 is most expensive for all types of market_segmentation
  2. Room Type 2 is cheaper for all types of market_segmentation

Correlation between type_of_meal_plan, avg_price_per_room and booking_status

Observation

  1. For all kinds of room types, there are more cancelations than non_Canceled rooms.

Correlation between market_segment_type, avg_price_per_room and booking_status

Observation

1.Booking_status = Not_canceled is lesser than Booking_status = canceled for all kinds of market_Segmentation

Correlation between market_segment_type, no_of_previous_cancellations and booking_status

Observation

  1. no_of_previous_cancellations are more if the booking_status is Not_Canceled and market_segment_type = Complementary and Corporate.
  2. no_of_previous_cancellations is least for all the Market Segments if the booking_status is canceled.

Correlation between type_of_meal_plan, no_of_previous_bookings_not_canceled and booking_status

Observation

  1. no_of_previous_bookings_not_canceled is least for booking_status = canceled for all the meal plans.
  2. no_of_previous_bookings_not_canceled is most for Meal Plan 3 and booking_status = not_canceled.

Correlation between type_of_meal_plan, no_of_previous_cancellations and market_segment_type

Observation

  1. no_of_previous_cancellations is least for market_segment_type = online and offline for all the meal plans.
  2. no_of_previous_cancellations is highest for market_segment_type = Complementart and Corporate for Meal PLan 1

Correlation between type_of_meal_plan, no_of_previous_cancellations and room_type_reserved

Observation

  1. There are more special requests for Room Type 7 for all the meal plans.
  2. Least Special reuests are for Meal Plan 3.

distribution of plot across each room_type_reserved - violin plot

Observation

  1. Room Type 7 is most expensive
  2. Room Type 2 is cheaper followed by Room Type 1

distribution of plot across each type_of_meal_plan - violin plot

Observation

  1. Average Price of the room is lesser if the meal plan 3 is selected.
  2. Average Price of the room is more if the meal plan 2 is selected.

Questions

1. What are the busiest months in the hotel?

Observation

  1. August is the busiest month in the hotel. There were more than 5000 bookings.
  2. Least booking were made in November and January

2. Which market segment do most of the guests come from?

Observation

  1. Most of the customers come from Online market segmentation.
  2. No customers come from Aviation market Segmentation.

3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

Observation

  1. The average price of the room is highest for Online market segmentation type (~120 Euros)
  2. The average price of the room is least for Complementary market segmentation type(~2 Euros)
  3. The differences between the prices for Aviation and Online market segmentation types is (~18 Euros)
  4. The differences between the prices for corporate and offline market segmentation is least (~5 Euros)
  5. The differences between the prices for complementary and online market segmentation is highest (~ 118 Euros)
  6. The differences between the prices for offline and online market segmentation is (~36 Euros)
  7. The difference between the prices for Complementary and Aviation is (~100 Euros)
  8. The difference between the prices for Complementary and corporate is (~80 Euros)
  9. The differnce between the prices for Complementary and offline is (~83 Euros)

4. What percentage of bookings are canceled?

Observation

  1. 34% of the bookings are cancelled.
  2. The rest of the 66% of the bookings are not canceled.

5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

Observation

  1. From the grap, it is observed that ~2% of the repeating guest cancel the bookings.
  2. More that 97% of the repeating guests donot cancel the bookings.

6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Observation

  1. The lesser the no_of_special_requests, the lesser the chances of bookings cancellations.
  2. The more the no_of_special_requests, the more the chances of bookings not getting canceled.

Observation based on EDA

  1. Most of the customers prefer Meal Plan 1, followed by Mean plan. Rest of customers who have not selected a meal preference.
  2. There are least cancellations in Meal plan 3. Cancellations are highest in Meal plan 2.
  3. Most preferred room type is 1. followed by type 4, type 6 and 5. Least preferred room type is type3 followed by 7.
  4. percentage of cancellations is high in type 6. There are no cancellations in type 7 and 3.
  5. Most of customers are online bookers followed by offline and coroprate. Least is aviation.
  6. There are no cancellations in market segment type - Aviation, complementary and corporate.
  7. Very few cancellations in offline segment. However, close to 65% cancellations in online segment.

Data Preprocessing

Missing Value Treatment

Check for missing value

Observation

  1. There are no missing values. Hence, there is no need for missing value treatment.

Feature Engineering

Outlier Treatment

Observation

  1. There are lower outliers in avg_price_per_room.
  2. There are no outliers for arrival_date, arrival_year, arrival_day.
  3. The other numerical variables have upper outliers.
  4. Since, these are proper values, no outlier treatment required.

EDA

  1. There are 42576 rows and 18 columns.

Let us explore numerical variables first

no_of_adults

Observations

no_of_adults = 2 in most of the cases.

Observations on children

Observation

In most of the cases no_of_children = 0. The distribution is positively skewed. There are 5 outliers

Observations on no_of_weekend_nights

There are 3 outliers.

Observations on required_car_parking_space

Observations on lead_time

Observations

The distribution is positively skewed. There are many outliers.

Observations on arrival_month

Observations

Most of the booking are in August month. The least number of bookings are in January and November

Observations on repeated_guest

Observation

Most of the customers are repeated guest.

Observations on no_of_previous_cancellations

Observation

Most of the customers didnot cancel the booking.

Observations on no_of_previous_bookings_not_cancelled

Observation

no_of_previous_bookings_not_calceled = 0 in most of the cases

Observation

The number of special guest range between 0 and 3

Now Lets Explore Categorical Variables

Observation on type_of_meal_plan

Observation

Maximum number of the customers prefer Meal Plan 1 There are some customers who dont have any preference.

Observation on room_type_reserved

Observation

Most of the customers prefer Room Type 1 Least preferred room is Room Type 3 followed by Room Type 7.

Observation on market_segment_type

Most of the customers prefer 'online' market_Segmentation_type. Least preferred is Aviation

Observation on Booking_status(Converted to numerical)

Observation

Most of the customers have not canceled their booking.

Relationship between type_of_meal_plan and room_type_reserved

Observation

Room_type 1 is mostly preferred followed by Room_type 4 Customers who prefer Meal Plan 1 mostly use Room_type 1

Relationship betweeen type_of_meal_plan and market_segment_type

Observation

  1. Most of the customers prefer Meal Plan 1
  2. Mean plan 1 customers prefer online market_segment_type.
  3. Customers who have not selected any meal plan prefer online market_segment_type.

Relationship between type_of_meal_plan and booking_status

Observation

  1. Most of the bookings are not_canceled.
  2. Customers who prefer Meal Plan 1 have more cancellations than other meal plan groups.

Relationship between room_type_reserved and market_segment_type

Observations

Customers prefer Online market segment type. Most of the customer choose Room Type 1 and online market segment type.

Relationship between market_segment_type and booking_status

Observation

It is observed that customers who prefer Online market segmentation had more cancellations than others. There are more customers with booking_status = not_canceled.

Relationship between room_type_reserved and booking_status

Observation

Customers who prefer Room_type 1 had more cancellations than other. There are more customers with booking_status = not_canceled.

Relationship between no_of_adults and avg_price_per_room

Observation

The more the number of adults, higher the average price of the room. The price is highest when the no_of_adults = 2

Relationship between no_of_children and booking_status

Observation

The chances of booking getting canceled is lesser when there are more children

Relationship between no_of_weekend_nights and booking_status

Observation

The chances of cancelations increases with the no_of_weekend_nights

Relationship between no_of_week_nights and booking_status

Observation

The cancellations is more when the no_of_week_nights choosen are more.

Relationship between required_car_parking_space and booking_status

Observation

Chances of room getting canceled is more if the required_car_parking_space = 0

Relationship between lead_time and booking_status

Observation

The more the lead_time, the chances of cancelation increases.

Relationship between arrival_year and booking_status

Relationship between no_of_special_requests and booking_status

Observation

The booking_status = canceled are more when the no_of_special_requests are less

Relationship between booking_status and avg_price_per_room

Observation

The avg_price_per_room is high if the customer chooses booking_status = canceled..

Relationship between type_of_meal_plan and booking_status

Observation

Max number of cancelations are seen form market_segment_type = Online and least for market_segment_type = Complementary

Relationship between room_type_reserved and booking_status

Observation

Max number of cancelations are seen when Room Type 6 is choosen

Correlation between type_of_meal_plan, avg_price_per_room and booking_status

Observation

Those bookings where Mean Plan 3 is choosed has maximum number of cancelations.

Correlation between type_of_meal_plan, avg_price_per_room and booking_status

Observation

For all kinds of room types, there are more cancelations than non_Canceled rooms.

Correlation between market_segment_type, avg_price_per_room and booking_status

Observation

Booking_status = Not_canceled is lesser than Booking_status = canceled for all kinds of market_Segmentation

Correlation between market_segment_type, no_of_previous_cancellations and booking_status

Observation

no_of_previous_cancellations are more if the booking_status is Not_Canceled and market_segment_type = Complementary and Corporate. no_of_previous_cancellations is least for all the Market Segments if the booking_status is canceled.

Correlation between type_of_meal_plan, no_of_previous_bookings_not_canceled and booking_status

Observation

no_of_previous_bookings_not_canceled is least for booking_status = canceled for all the meal plans. no_of_previous_bookings_not_canceled is most for Meal Plan 3 and booking_status = not_canceled.

Correlation between type_of_meal_plan, no_of_previous_cancellations and market_segment_type

Observation

no_of_previous_cancellations is least for market_segment_type = online and offline for all the meal plans. no_of_previous_cancellations is highest for market_segment_type = Complementart and Corporate for Meal PLan 1

Correlation between type_of_meal_plan, no_of_previous_cancellations and room_type_reserved

Observation

There are more special requests for Room Type 7 for all the meal plans. Least Special reuests are for Meal Plan 3.

Questions

1. What are the busiest months in the hotel?

Observation

  1. August is the busiest month in the hotel. There were more than 5000 bookings.
  2. Least booking were made in November and January

Which market segment do most of the guests come from?

Observation

  1. Most of the customers come from Online market segmentation.
  2. No customers come from Aviation market Segmentation.

3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

Observation

  1. The average price of the room is highest for Online market segmentation type (~120 Euros)
  2. The average price of the room is least for Complementary market segmentation type(~2 Euros)
  3. The differences between the prices for Aviation and Online market segmentation types is (~18 Euros)
  4. The differences between the prices for corporate and offline market segmentation is least (~5 Euros)
  5. The differences between the prices for complementary and online market segmentation is highest (~ 118 Euros)
  6. The differences between the prices for offline and online market segmentation is (~36 Euros)
  7. The difference between the prices for Complementary and Aviation is (~100 Euros)
  8. The difference between the prices for Complementary and corporate is (~80 Euros)
  9. The differnce between the prices for Complementary and offline is (~83 Euros)

4. What percentage of bookings are canceled?

Observation

  1. 34% of the bookings are cancelled.
  2. The rest of the 66% of the bookings are not canceled.

5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

Observation

  1. From the grap, it is observed that ~2% of the repeating guest cancel the bookings.
  2. More that 97% of the repeating guests donot cancel the bookings.

6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Observation

The lesser the no_of_special_requests, the lesser the chances of bookings cancellations. The more the no_of_special_requests, the more the chances of bookings not getting canceled.

Checking Multicollinearity

Building a Logistic Regression model

Building the model

Model Evaluation Criterion

Model can make wrong predictions as:

  1. Predicting a booking gets cancelled but in reality the booking doesnot get cancelled.
  2. Predicting a booking doesnot get cancelled but in reality the booking gets cancelled

Which case is more important?

Both the cases are important.

How to reduce this loss i.e need to reduce False Negatives?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Logistic Regression(with Sklearn library)

Checking model performance on training set

Checking the performance on test set

Observation

  1. The traing and testing Recall scores are ~0.62.
  2. The Recall values on test and train are comparable.
  3. This logistic regression model shows good performance on train and test data respectively.
  4. To identify significant variables, statsmodel is recommended.

Logistic Regression (with statsmodels library)

Observation

  1. Negative coefficient value indicates that the probability of bookings being a defaulter decreases with the increase in the attribute value and vice-versa for positive coefficient values.
  2. Any variable whos p-value is less than 0.05 would be considered significant.
  3. These values might contain multicollinearity which might effect the p-value.

There is a need to remove the multicollinearity, which effects the p-values.

Multicollinearity

Observation

  1. market_segment_type_Corporate, market_segment_type_Offline, market_segment_type_Online exihibit high multicollinearity but not market_segment_type_Complementary.
  2. For dropping the categorical variable all of it's levels should have a high VIF. Even if one level doesn't have high VIF then you can't drop the other levels.

Observation

  1. After removing market_segment_type_Corporate, we see that there is no variable with VIF>10.
  2. Accuracy = 0.79
  3. Recall = 0.61
  4. Precision = 0.73
  5. F1 = 0.67

Observation

Some of the variables have p-value > 0.05. These variables can be dropped.

Dropping market_segment_type_Complementary

Observation

There are still variables whose p-value >0.05. Running a loop to drop variables with higher p-value.

Observation

All the above variables have p-value <0.05

Converting coefficients to odds

Coefficient Interpretations

  1. keeping all the other features constant, a unit change in the no_of_children will increase the odds of bookings being defaulter by 110%.
  2. keeping all the other features constant, a unit change in the required_car_parking_space will decrease the odds of bookings being defaulter by 22% or a 77.9% decrease in odds.
  3. Similarly, for all the attributes.

checking model performance on the training set

AOC curve on training set

Model performnce Improvement

Lets check if the f1 score values can be improved. This can be done by changing the model threshold value according to the AUC-ROC curve

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Observation

Model Performance has significantly improved. Recall increased from 0.613306 to 0.806 Precession and Accuracy decreased.

Let's use Precision-Recall curve and see if we can find a better threshold

At the threshold of 0.42, we get balanced recall and precision. Selecting a value around 0.40

Checking the model performance on training set

Observation

  1. Recall improved from 0.61 to 0.71
  2. Model is giving a better performance with 0.3068 threshold found using AUC-ROC curve.

Model Performance Summary

Let's check the performance on the test set

Dropping the columns from the test set that were dropped from the training set

Using model with default threshold

Using model with threshold value = 0.306

Using model with threshold = 0.40

Model Performance Summary

Final Model Summary

  1. All the models are giving a generalized performance on training and test data.
  2. The highest recall is 0.806752 on the training data.
  3. Using the model with default threshold = 0.306, will give high recall and lesser precision. This model will help
  4. Using the model with default threshold = 0.40, will give a good balance between precision and recall.
  5. Building a Decision Tree model

Building a Decision Tree model

  1. We will build our model using the DecisionTreeClassifier function.
  2. Using default 'gini' criteria to split. Other option include 'entropy'.

Scoring our Decision Tree

Does extremely well on training data.

Model can make wrong predictions as:
  1. Predicting a booking gets cancelled but in reality the booking doesnot get cancelled.
  2. Predicting a booking doesnot get cancelled but in reality the booking gets cancelled ##### Which case is more important?
  3. Predicting a booking doesnot get cancelled but in reality it gets cancelled as this might result in huge loss.
How to reduce this loss i.e need to reduce False Negatives?
  1. Recall gives the ratio of True posiitives to Actual Positives. So, high recall means less false negatives. Lower chances of predicting the defaulter as non-defaulter.
  2. Recall should be maximized, the greater the Recall higher the chances of identifying.

Confusion Matrix

Visualize the Decision Tree

Observation

  1. As the importance of the attributes decreases from top to bottom.
  2. lead_time, avg_price_per_room, no_of_special_requests, arrival_date have higher importance.

As per the decision tree model, lead_time is the important variable for predicting

The above tree is very complex, such a tree usually overfits.

Reduce the OverFit

  1. The deeper the tree, the more complex the model because it will have more splits and this captures more information and this is one of the causes for overfitting.
  2. Now, Lets try to limit the tree to 3.

Do we need to prune the tree?

Confusion Matrix - dTree with depth limited to 3

  1. Recall on training data has reduced from 0.99 to 0.74 which is an improvement as the model is not overfitting.

Lets Visualize the decision tree

Observation

  1. Here we see that some important features like arrival date value which was on top 4th position changed to zero.This is the shortcoming of pre-pruning.
  2. Its bad to have a very low depth because the model may underfit.
  3. Lets choose prepruning using grid search.

Reducing the OverFit

Using GridSearch for Hyperparameter tuning to the tree model

check the model performance on training data

checking the model performance on test data

This model is basically giving general results on both training and test data

  1. We got a simplofied tree after pre-pruning

Cost Complexity Pruning

  1. This is another method to control the size of a tree which is parameterized by the cost complexity parameter, ccp_alpha.
  1. Now we will train a decision tree using the effective alphas.
  2. The last vale in the ccp_alphas is that value which prunes the entire tree, leaving the tree, dtree_classifier[-1] with one single node.

Accuracy vs Alpha for training and testing sets

best model in terms of recall

Confusion Matrix of the post-pruned decision tree

Model Performance Comparison and Conclusions

Observation

  1. According to the Recall definition, high recall means less false negatives. Lower chances of predicting the cancelation as non-cancelation. Recall should be maximized, the greater the Recall higher the chances of identifying.
  2. Here, the highest recall value is obtained for Logistic Regression Stats Model - 0.306 Threshold.
  3. the highest accuracy value is obtained for Decision tree with post-pruning
  4. The difference between recall values for Logistic Regression Stats Model - 0.306 Threshold and Decision tree with post-pruning is 4%. The value of recall for Logistic Regression Stats Model - 0.306 Threshold is only slightly more.
  5. The Decision tree with post-pruning has highest accuracy 0.84 and recall value is 0.77. This tree with post pruning is not complex and easy to interpret.
  6. The Decision tree with post-pruning is preferred.

Actionable Insights and Recommendations

Lead time, average price per room, no of special requests, market segment type online play an major role in cancellations.